Codex/int96 compat default#64882
Conversation
### What problem does this PR solve? Issue Number: N/A Related PR: N/A Problem Summary: Refactor the file table reader stack around the format_v2 reader implementation. This includes the new file reader abstractions, parquet reader components, table reader adapters for Hive/Iceberg/Paimon/JDBC, ColumnMapper filter and projection handling, expression clone support used by file-local filter rewrites, Iceberg row lineage materialization, and related BE unit tests and design notes. This commit is the squashed result of rebasing the current branch onto master. ### Release note None ### Check List (For Author) - Test: No need to test (history rewrite only in this step; no code changes beyond the already rebased branch content) - Behavior changed: No - Does this need documentation: No
) ### What problem does this PR solve? This PR refactors the new parquet reader complex-column path around schema projection and reader creation. It clarifies `ParquetColumnReaderFactory` recursion, keeps schema projection as a single public helper, normalizes file-local `ColumnDefinition` children to Doris semantic children, and folds the Parquet MAP `key_value/entry` wrapper into the MAP schema node during parquet schema construction. With that shape, MAP and LIST both expose direct semantic children to the table/file reader boundary, and ParquetReader no longer needs a semantic-to-physical projection translation layer for MAP. ### Release note None ### Check List - Test: Manual test - Ran `git diff --check`. - Tried `./run-be-ut.sh -j 8 --run --filter="ParquetColumnReaderTest.ReadProjectedMapStructValueChildren:ColumnMapperScanRequestTest.MapValueStructProjectionPrunesValueChildren"`, but local toolchain failed before tests with `ld: library 'c++' not found`. - Behavior changed: No - Does this need documentation: No
…ns (apache#64480) ### What problem does this PR solve? Localize slot-rooted struct element predicates through ColumnMapping so renamed nested fields rewrite both selector names and projected file child return types. Keep computed complex parents and evolved MAP_KEYS-style filters at the table layer instead of generating unsafe file-local casts. Rebuild complex scan projections and rematerialize struct children in table type order, add debug-only block sanity checks with contextual errors, preserve function-call clone state, and handle nullable Iceberg delete columns. Move ColumnMapper coverage into column_mapper_test and add tests for complex child projection, nested predicate projection, map/array fallback, and file-local conjunct localization boundaries. ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into -->
### What problem does this PR solve? Issue Number: None Related PR: apache#63893 Problem Summary: Refresh regression expected outputs for the new Parquet INT96 timestamp semantics. The new reader decodes INT96 timestamps without applying the session timezone offset, so affected Asia/Shanghai-based expectations move 8 hours earlier. This commit updates only the expected result files for the INT96 offset cases in p0 and external suites, while leaving unrelated timestamp failures unchanged. ### Release note None ### Check List (For Author) - Test: Manual test - Verified staged diffs are limited to regression expected-output files. - Verified 634 changed lines are exactly timestamp minus 8 hours, with paimon c2 handled as a selective LTZ/INT96 column update. - Full regression test not run. - Behavior changed: No, test expectations only - Does this need documentation: No
### What problem does this PR solve? Issue Number: None Related PR: apache#63893 Problem Summary: The new Parquet reader rejected MAP schemas whose key field is optional with `Unsupported nullable parquet MAP key`, while the old reader only logged a warning and continued. Some external Parquet writers can emit optional MAP keys, and the v2 schema builder already preserves definition levels and exposes MAP key/value types as nullable. This change removes the schema-level hard rejection for optional MAP keys while keeping the existing structural MAP layout checks. ### Release note Allow the new Parquet reader to read external Parquet MAP columns with optional key fields. ### Check List (For Author) - Test: Manual test - `build-support/clang-format.sh be/src/format_v2/parquet/parquet_column_schema.cpp` - `git diff --check` - Behavior changed: Yes, the new Parquet reader no longer rejects optional MAP key schemas. - Does this need documentation: No
### What problem does this PR solve? Issue Number: None Related PR: apache#63893 Problem Summary: RuntimeFilterExpr is a wrapper around the concrete runtime filter predicate, but its normal column execution path returned `Not implement RuntimeFilterExpr::execute_column_impl`. Partition pruning for external tables can evaluate runtime filter wrappers as ordinary expressions, so Hive/Iceberg/Paimon runtime filter partition pruning failed before evaluating the wrapped predicate. This change delegates RuntimeFilterExpr::execute_column_impl to the wrapped implementation. The scan filter path still uses execute_filter, preserving the existing selectivity counters and runtime-filter-specific row filtering behavior. ### Release note Fix runtime filter wrapper expression execution for external table partition pruning. ### Check List (For Author) - Test: Manual test - `build-support/clang-format.sh be/src/exprs/runtime_filter_expr.cpp` - `git diff --check` - Behavior changed: Yes, runtime filter wrappers can now be evaluated through the normal expression execution path. - Does this need documentation: No
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#63893 Problem Summary: The new parquet reader resolved decimal columns with precision greater than 38 to TYPE_DECIMAL256, but then marked those columns as unsupported for the scalar record reader. This made external parquet scans fail with errors such as "Current parquet scalar reader does not support column amount" even though the old parquet reader and the decoded decimal serde already support Decimal256 for the common parquet decimal physical carriers. Remove the extra precision-based block so Decimal256 columns can use the existing record reader and decoded serde path, while unsupported physical types remain rejected. ### Release note Support Decimal256 parquet columns in the new parquet reader. ### Check List (For Author) - Test: Manual test - Ran build-support/clang-format.sh be/src/format_v2/parquet/parquet_type.cpp - Ran git diff --check - Behavior changed: Yes (new parquet reader now accepts parquet decimal columns with precision greater than 38 when they fit Doris Decimal256) - Does this need documentation: No
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#63893 Problem Summary: The new parquet reader created the FilteredRowsByLazyRead profile counter but did not pass it to the scan scheduler or update it after predicate filtering. As a result lazy materialization profile tests saw zero filtered lazy-read rows even when non-predicate columns were read lazily. Pass the counter into ParquetScanScheduler and update it by the number of rows filtered by conjuncts when non-predicate columns are present. ### Release note None ### Check List (For Author) - Test: Manual test - Ran build-support/clang-format.sh be/src/format_v2/parquet/parquet_scan.h be/src/format_v2/parquet/parquet_scan.cpp be/src/format_v2/parquet/parquet_reader.cpp - Ran git diff --check - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#63893 Problem Summary: Clean up format_v2 code style after the new parquet reader changes. This change fixes local tidy-style issues in the parquet selection vector by using std::cmp_greater/std::cmp_greater_equal for mixed signed and unsigned comparisons and designated initializers for RowRange construction. It also removes unused include directives from format_v2 implementation files. The format_v2 directory was checked with the repository clang-format script, and the include cleanup was validated by compiling the Fedora DEBUG Format target. ### Release note None ### Check List (For Author) - Test: Manual test - Ran build-support/clang-format.sh be/src/format_v2 - Ran git diff --check - Ran targeted clang-tidy on Fedora with checks modernize-use-integer-sign-comparison and modernize-use-designated-initializers for be/src/format_v2/parquet/selection_vector.h using the DEBUG compile database without PCH - Ran clang-include-cleaner on selected format_v2 implementation files to collect remove candidates - Ran /home/socrates/ldb_toolchain/bin/ninja Format in /home/socrates/code/doris/be/build_DEBUG on Fedora - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#63893 Problem Summary: TeamCity external regression build 970191 still had several expected output files using old timestamp values. The new parquet timestamp semantics return the corrected values for the affected external table cases, including Hive, Iceberg, Paimon, and TVF parquet result files. This commit refreshes the corresponding regression expected outputs from the observed CI results and keeps unrelated non-timestamp failures untouched. ### Release note None ### Check List (For Author) - Test: Manual test - Compared TeamCity build 970191 failure details with the updated expected output files. Full regression test was not rerun locally. - Behavior changed: No - Does this need documentation: No
struct type should be use name mode with nested type
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: New parquet profile definitions and wiring were split across ParquetReader, ParquetScan, and column reader headers. This made ParquetReader own counter initialization, pruning counter updates, and scheduler sub-profile assembly directly even though parquet_profile.h already existed for profile-related types. This change centralizes the new parquet RuntimeProfile counter ownership in parquet_profile.h/.cpp and keeps ParquetReader responsible only for invoking the profile helper methods.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran build-support/clang-format.sh for touched files.
- Ran git diff --check.
- Tried ./run-be-ut.sh --run '--filter=NewParquetReaderTest.*', but local CMake compiler detection failed before building Doris because /opt/homebrew/opt/llvm@16/bin/clang++ could not link a simple program: ld: library 'c++' not found.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Align format_v2 implementation namespaces with the format_v2 ownership boundary. Parquet, Hive, Paimon, Iceberg, and JDBC implementations now live under doris::format subnamespaces, while shared format_v2 expression helpers live under doris::format. Call sites and tests were updated to use the new namespace layout.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran build-support/clang-format.sh on the modified BE files
- Ran git diff --check
- Ran namespace residue scans for old doris::parquet/hive/paimon/jdbc/iceberg namespaces and duplicate format::format references
- Attempted targeted BE UT with ./run-be-ut.sh --run '--filter=NewParquetReaderTest.*:ParquetColumnReaderTest.*:TableReaderTest.*:CastTest.*:DeletePredicateTest.*:EqualityDeletePredicateTest.*', but local CMake compiler detection failed before Doris code compiled because /opt/homebrew/opt/llvm@16/bin/clang++ could not link libc++: ld: library 'c++' not found
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#63893 Problem Summary: The new parquet reader reports timestamp values with the updated INT96 timestamp interpretation for existing external parquet coverage. This commit updates the affected regression expected outputs from the latest TeamCity P0 and external regression real outputs. Doris parquet export/write cases with suspicious timestamp offsets are intentionally excluded because those require separate writer-side analysis. ### Release note None ### Check List (For Author) - Test: Manual test - Validated modified expected rows against TeamCity builds 970619 and 970620 failure logs, and ran `git diff --check`. - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#63893 Problem Summary: The new parquet reader did not map TIMESTAMP(NANOS) logical columns to a supported Doris timestamp type, and DATETIMEV2 decoded INT64 timestamp values only handled millis and micros. As a result Hive parquet timestamp nanos data was materialized as NULL instead of the expected timestamp values. This change maps parquet timestamp nanos to DATETIMEV2(6), decodes nanos by truncating to microseconds, and adds decoded-value coverage for DATETIMEV2 nanos. It also refreshes the external TVF group4 expected output for a parquet file containing BC timestamp values that Doris cannot represent, where the new reader correctly returns NULL for those rows. ### Release note None ### Check List (For Author) - Test: Manual test - Ran `git diff --check`. - Verified the relevant parquet files with DuckDB to confirm timestamp nanos and BC timestamp source values. - Attempted `./run-be-ut.sh --run '--filter=DataTypeSerDeDecodedValuesTest.*'`, but local CMake failed before compiling tests because the macOS toolchain cannot link a simple C++ program: `ld: library 'c++' not found`. - Behavior changed: Yes. The new parquet reader now reads TIMESTAMP(NANOS) values as DATETIMEV2(6) instead of producing NULL through unsupported conversion. - Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Refine the new Parquet reader row group pruning flow so scan range filtering is applied before more expensive statistics, dictionary, bloom filter, and page index pruning. Also document the Parquet reader, scan scheduler, statistics pruning, and nested column reader APIs, and update affected namespace references in BE tests.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran build-support/clang-format.sh on touched BE C++ files and git diff --check locally.
- Started BE UT validation on Fedora with NewParquetReaderTest.* and ParquetBloomFilterPruningTest.*; fixed compile issues found during validation. Full rerun was interrupted before completion by follow-up history cleanup request.
- Behavior changed: No
- Does this need documentation: No
Rewrite comments for the entry-point and foundational modules: parquet_reader.h: - Class-level doc: role boundary, lifecycle (init→get_schema→open→get_block→close) - TableReader calling relationship explained - Each method and field annotated parquet_type.h: - ParquetExtraTypeInfo: each variant documented - ParquetTypeDescriptor: full field-by-field descriptions - Three-level resolution priority (logical→converted→physical) explained - resolve_parquet_type / supports_record_reader / decoded_value_kind docs parquet_column_schema.h: - Class-level doc: design decisions (wrapper folding, nullable, Dremel levels) - All fields grouped into sections (identifier / type / levels / children) - Each field annotated with its role and valid domain (PRIMITIVE vs complex) parquet_column_schema.cpp: - SchemaBuildContext fields annotated parquet_file_context.cpp: - DorisRandomAccessFile adapter class documented parquet_profile.h: - All Profile structs with section-based Chinese comments - Counter groups organized (RG pruning / page skip / batch read / column reader / file ops / decompress & cache / decode / others) Co-Authored-By: Claude <noreply@anthropic.com>
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: The new Parquet scanner does not implement condition cache yet, so the parquet condition cache regression case can fail when the session uses FileScannerV2. Force this case to use the old file scanner path before enabling condition cache and profile checks.
### Release note
None
### Check List (For Author)
- Test: No need to test
- Regression-only session variable adjustment for an existing case.
- Behavior changed: No
- Does this need documentation: No
TPC-H: Total hot run time: 29062 ms |
TPC-DS: Total hot run time: 171372 ms |
FE UT Coverage ReportIncrement line coverage |
ClickBench: Total hot run time: 25.15 s |
FE Regression Coverage ReportIncrement line coverage |
8b8629d to
9562bbb
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 28727 ms |
|
run buildall |
TPC-DS: Total hot run time: 171748 ms |
ClickBench: Total hot run time: 25.82 s |
TPC-H: Total hot run time: 29046 ms |
FE UT Coverage ReportIncrement line coverage |
TPC-DS: Total hot run time: 171255 ms |
ClickBench: Total hot run time: 25.23 s |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
TPC-H: Total hot run time: 29321 ms |
TPC-DS: Total hot run time: 171835 ms |
ClickBench: Total hot run time: 25.33 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)